Python for audio processing
Contents
Python for audio processing#
All code-related materials in this tutorial are based in Python. There are several relevant courses and learning materials you may refer to for learning the basics of Python for audio signal processing. We want to highlight the course in Coursera on Audio Signal Processing for Music Applications and AudioLabs-Erlangen FMP Notebooks.
In this section we provide an overview of the very basics of Python for audio signal processing, hoping that it serves as a useful entry point to this tutorials for readers that do not have a lot of experience on that particular topic. If you are familiar with the said topic, you may move on.
Discretizing audio signals#
In a real-world scenario, sound is produced by pressure waves, air vibrations that are processed by our ears and converted into what we hear. From a computational point of view, we have to convert these signals into a measurable representation that a machine is able to read and process. In other words, these signals must be digitalized or discretizing the continuous (or analog) sound into a representable sequence of values. See an example in Fig. 1.
Fig. 1 We represent the continuous signal using a sequence of points (image from sonimbus.com).#
Note
The time-step between points is given by the sampling frequency (or sampling rate), which is a variable given in Hertz (Hz), and that indicates us how many samples per seconds we are using to represent the continuous signal. Common values for music signals are 8kHz, 16kHz, 22.05kHz, 44.1kHz, and finally 48kHz.
The important thing here is: these captured discrete values can be easily loaded and used in a Python project. We can use several different libraries and pipelines to load and process audio signals in a Python environment. As you may be expecting, the computational analysis of Indian Art Music builds on top of representations of the audio signals from which we extract the musicologically relevant information and therefore, it is important to understand how we handle these data.
Loading an audio signal#
We will load an audio signal to observe how it looks like in a Python development scenario. We will use Essentia, an audio processing library which is actually written in C++ but is wrapper in Python. Therefore it is fast! Let’s first install Essentia to our environment.
%pip install essentia
Let’s now import the Essentia standard module, which includes several different algorithms for multiple purposes. Check out the documentation for further detail.
import essentia
import essentia.standard as estd
# Let's also import util modules for data processing and visualisation
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
We will now use the MonoLoader function, which as the name suggests, loads an audio signal in mono.
Note
As you may know, audio signals can be represented in two channels, with different information in each of them. This is typically used in modern music recording systems, to mix and pan the different sources away from the center.
audio = estd.MonoLoader(filename="../audio/sharanu_janakana.mp3")()
print(np.shape(audio))
(18617951,)
By default, MonoLoader() resamples the input audio to 44.1Hz. Having this in mind, we can compute the actual duration of the piece in seconds.
import datetime
record_seconds = len(audio) / 44100
str(datetime.timedelta(seconds=record_seconds))
'0:07:02.175760'
Our variable audio is now a one-dimensional array of values, each of these values representing the points we have used to discretize the analog signal.
Important
Most signal processing libraries load the audio signals in the sampling rate originally used when the audio was converted to digital. However, we may be able to resample the audio at loading time, if needed.
audio_22050 = estd.MonoLoader(filename="../audio/sharanu_janakana.mp3", sampleRate=22050)()
print(np.shape(audio))
print(len(audio) / len(audio_22050))
(18617951,)
2.0000001074232125
If we use sampling rate of 22.05kHz, the observe that the length in samples of the audio signal is exactly half the length of the signal sampled at 44.1Hz. That is because, we are using half the samples to represent a second of audio.
Visualising the audio signal#
Visualising (and also listening!) to our data is very important. Python can help you with that! Listening to our data within a Python environment can be achieved in many ways. If you are working with Jupyter notebooks, you can use IPython.display.
ipd.Audio("../audio/sharanu_janakana.mp3", rate=44100)
Let’s now visualize the waveform.
audio = estd.MonoLoader(filename="../audio/sharanu_janakana.mp3")()
plt.plot(audio)
plt.title("Waveform of Sharanu Janakana")
plt.show()
Representing audio in the frequency domain#
You might have heard about the Fourier transform. In a few words, it is basically a transformation in which a signal represented in the time domain is decomposed in the frequency components that are present in the signal. To understand that statement, it is important to note that an audio signal is formed by the combination of multiple sinusoids, which are oscillatory signals defined by the following formula:
In other words, each sinusoid is defined by a particular amplitude \(A\), frequency \(f\), and phase \(\phi\), and follows a sinusoidal behaviour, oscillating at the given frequency. The variable \(t\) represents the time vector, in which we are capturing or analyzing the signal. Let’s generate a sinusoid at frequency 440Hz.
Important
Be careful with the volume (btw given by the amplitude \(A\)) when listening to single sinusoids (aka pure tones). It might be harmful, specially if listened with headphones and for a long time.
t = np.arange(0, 5, 1/44100) # Duration: 5s, sampling rate: 44100Hz
sinusoid_440 = np.sin(2*np.pi*440*t + 0)
print("Playing sinusoid of 440Hz")
ipd.Audio(sinusoid_440, rate=44100)
Playing sinusoid of 440Hz
The Fourier’s Theorem basically states that any signal can be decomposed into an overlap of several of these sinusoids. Why is that relevant?. Let’s take a look at some different sinusoids.
sinusoid_220 = np.sin(2*np.pi*220*t + 0)
print("Playing sinusoid of 220Hz")
ipd.Audio(sinusoid_220, rate=44100)
Playing sinusoid of 220Hz
sinusoid_880 = np.sin(2*np.pi*880*t + 0)
print("Playing sinusoid of 880Hz")
ipd.Audio(sinusoid_880, rate=44100)
Playing sinusoid of 880Hz
Higher frequencies are high pitched, while lower frequencies sound lower to our ears. In practice, computing the frequencies that are present in a music signal provides a lot of relevant information to understand the signals in which we are working on. Within the context of computational analysis of a musical repertoire, the frequency domain is useful for many purposes, e. g. to understand the timbre of particular instruments, to capture the notes or melodies that are being performed, and many more. That is why, several approaches within the field of ASP and MIR build on top of this representation, which in many cases captures more information that the raw waveform.
Let’s now visualise, in practise, how a Fourier transform, also known as spectrum of a particular signal looks like:
# We sum the three sinusoids
signal = sinusoid_220 + sinusoid_440 + sinusoid_880
# We create the spectrum of the signal
window = estd.Windowing(type="hann")
spectrum = estd.Spectrum()
signal_spectrum = spectrum(window(signal))
xf = np.arange(0, len(signal_spectrum), 1) # x vector to represent the frequencies
plt.plot(xf, signal_spectrum)
plt.xlim(0, 1000) # Limiting view to area of interest
plt.title("Spectrum of a simple signal")
plt.show()
We use a window function here to produce a fade-in/fade-out effect and remove possible noises that might have appeared at the boundaries of the audio signal. In the sense, we may obtain a cleaner frequency analysis.
Note
We could also use estd.FFT(), with FFT standing for (Fast Fourier Transform). However, this function does not disregard the complex part generated by the transformation. For more information see the documentation.
Note that the peaks are showing up exactly where the frequencies of the considered sinusoids are. This representation is OK for this use case, however, trying to compute the spectrum for “Sharanu Janakana” will be not very useful, since a massive amount of different sinusoids will appear in a 7 minute performance. For that reason, in a computational musicology scenario, we need a representation that also considers the time dimension.
Time-frequency representation#
A spectrogram, commonly computed using the STFT (Short-Time Fourier Transform), is a time-frequency representation in which a chain of spectrums are calculated from a signal by sliding, with a particular hop size a window of particular window size (or frame size) which indicates the time section in which to apply the frequency decomposition. We always use window size > hop size, therefore there is an overlap of windows, which actually provides a smoother chaining and better frequency analysis. A particular window function is applied to each frame to ensure a more natural overlap. For our computational musicology studies, this representation is very important.